跑训练模型性能时报错RuntimeError: ACL stream synchronize failed, error code:507015 · Issue #I5UI3H · Ascend/modelzoo

您所在的位置:网站首页 acl code 跑训练模型性能时报错RuntimeError: ACL stream synchronize failed, error code:507015 · Issue #I5UI3H · Ascend/modelzoo

跑训练模型性能时报错RuntimeError: ACL stream synchronize failed, error code:507015 · Issue #I5UI3H · Ascend/modelzoo

2024-03-09 17:58| 来源: 网络整理| 查看: 265

一、问题现象(附报错日志上下文): Traceback (most recent call last): File "tools/train_net.py", line 195, in args=(args,), File "/opt/ModelZoo-PyTorch-master/PyTorch/contrib/cv/detection/Cascade_RCNN/detectron2/engine/launch.py", line 82, in launch main_func(*args) File "tools/train_net.py", line 183, in main return trainer.train() File "/opt/ModelZoo-PyTorch-master/PyTorch/contrib/cv/detection/Cascade_RCNN/detectron2/engine/defaults.py", line 422, in train super().train(self.start_iter, self.max_iter) File "/opt/ModelZoo-PyTorch-master/PyTorch/contrib/cv/detection/Cascade_RCNN/detectron2/engine/train_loop.py", line 162, in train self.run_step() File "/opt/ModelZoo-PyTorch-master/PyTorch/contrib/cv/detection/Cascade_RCNN/detectron2/engine/train_loop.py", line 262, in run_step scaled_loss.backward() File "/usr/local/python3.7.5/lib/python3.7/contextlib.py", line 119, in exit next(self.gen) File "/usr/local/python3.7.5/lib/python3.7/site-packages/apex/amp/handle.py", line 142, in scale_loss optimizer._post_amp_backward(loss_scaler) File "/usr/local/python3.7.5/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 397, in post_backward_with_master_weights self._amp_combined_init() File "/usr/local/python3.7.5/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context return func(*args, **kwargs) File "/usr/local/python3.7.5/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 661, in combined_init_with_master_weights stash.main_fp16_grad_combine, stash.fp16_grad_list = get_grad_combined_tensor_from_param(stash.all_fp16_params) File "/usr/local/python3.7.5/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 34, in get_grad_combined_tensor_from_param original_combined_tensor = combine_npu(list_of_grad) File "/usr/local/python3.7.5/lib/python3.7/site-packages/apex/contrib/combine_tensors/combine_tensors.py", line 27, in combine_npu combined_tensor = torch.zeros(total_numel, dtype=dtype).npu() RuntimeError: ACL stream synchronize failed, error code:507015 THPModule_npu_shutdown success

二、软件版本: --CANN 版本:6.0.RC1 --固件驱动版本:5.1.rc2 --Pytorch 版本:torch1.5-20211229 --Python 版本:Python 3.7.5 --操作系统版本 :Ubuntu 20.04.1 LTS --架构:x86 --模型脚本:Cascade_RCNN 三、硬件版本: 华为A800-9010 910B cpu相关信息 内存使用情况 磁盘使用情况 四、日志情况 完整日志1 完整日志2 五、尝试操作: 更换版本 --CANN 版本:5.0.3 --驱动版本:21.0.3.2 --固件版本:1.79.22.7.220 --Pytorch 版本:torch1.5-20211229 报错RuntimeError: ACL stream synchronize failed, error code:507015 与此前一致



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3